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Abstract 

This  paper  describes  classification  of  typed 
student  utterances  within  AutoTutor,  an  intel¬ 
ligent  tutoring  system.  Utterances  are  classi¬ 
fied  to  one  of  18  categories,  including  16 
question  categories.  The  classifier  presented 
uses  part  of  speech  tagging,  cascaded  finite 
state  transducers,  and  simple  disambiguation 
rules.  Shallow  NLP  is  well  suited  to  the  task: 
session  log  file  analysis  reveals  significant 
classification  of  eleven  question  categories, 
frozen  expressions,  and  assertions. 

1  Introduction 

AutoTutor  is  a  domain-portable  intelligent  tutoring 
system  (ITS)  with  current  versions  in  the  domains  of 
physics  and  computer  literacy  (Graesser  et  al.  1999; 
Olney  et  al.  2002).  AutoTutor,  like  many  other  ITSs,  is 
an  intersection  of  applications,  including  tutoring, 
mixed-initiative  dialogue,  and  question  answering.  In 
each  of  these,  utterance  classification,  particularly  ques¬ 
tion  classification,  plays  a  critical  role. 

In  tutoring,  utterance  classification  can  be  used  to 
track  the  student's  level  of  understanding.  Contribution 
and  question  classifications  can  both  play  a  role:  contri¬ 
butions  may  be  compared  to  an  expected  answer 
(Graesser  et  al.  2001)  and  questions  may  be  scored  by 
how  "deep"  they  are.  For  example.  The  PREG  model 
(Otero  and  Graesser  2001)  predicts  under  what  circum¬ 
stances  students  will  ask  "deep"  questions,  i.e.  those  that 
reveal  a  greater  level  of  cognitive  processing  than  who, 
what,  when,  or  where  questions.  A  student  who  is  only 
asking  shallow  questions,  or  no  questions  at  all,  is  pre¬ 
dicted  by  PREG  to  not  have  a  situation-level  under¬ 
standing  (van  Dijk  and  Kintsch  1983)  and  thus  to  learn 
less  and  forget  faster.  The  key  point  is  that  different 


metrics  for  tracking  student  understanding  are  applica¬ 
ble  to  questions  and  contributions.  Distinguishing  them 
via  classification  is  a  first  step  to  applying  a  metric. 

In  mixed-initiative  dialog  systems,  utterance  classifi¬ 
cation  can  be  used  to  detect  shifts  in  initiative.  Eor  ex¬ 
ample,  a  mixed-initiative  system  that  asks,  "Where 
would  you  like  to  travel",  could  respond  to  the  question, 
"Where  can  I  travel  for  $200?"  (Allen  1999)  by  giving  a 
list  of  cities.  In  this  example,  the  user  is  taking  the  ini¬ 
tiative  by  requesting  more  information.  In  order  to  re¬ 
spond  properly,  the  system  must  detect  that  the  user  has 
taken  initiative  before  it  can  respond  appropriately;  oth¬ 
erwise  it  might  try  to  interpret  the  user's  utterance  as  a 
travel  destination.  In  this  sense,  questions  mark  redirec¬ 
tion  of  the  dialogue,  whereas  contributions  are  continua¬ 
tions  of  the  dialogue.  In  order  for  a  user  to  redirect  the 
dialogue  and  thus  exercise  initiative,  a  mixed-initiative 
system  must  be  able  to  distinguish  questions  and  contri¬ 
butions. 

Question  classification  as  early  as  Lehnert  (1978) 
has  been  used  as  a  basis  for  answering  questions,  a  trend 
that  continues  today  (Voorhees  2001).  A  common  fea¬ 
ture  of  these  question-answering  systems  is  that  they 
first  determine  the  expected  answer  type  implicit  in  the 
question.  Eor  example,  "How  much  does  a  pretzel  cost" 
might  be  classified  according  to  the  answer  type  of 
MONEY  or  QUANTITY.  Knowledge  of  the  expected  an¬ 
swer  type  can  be  used  to  narrow  the  search  space  for  the 
answer,  either  online  (Brill  et  al.  2001)  or  in  a  database 
(Harabagiu  et  al.  2000).  Accordingly,  question  answer¬ 
ing  calls  for  a  finer  discrimination  of  question  types  as 
opposed  to  only  distinguishing  questions  from  contribu¬ 
tions. 

AutoTutor  uses  utterance  classification  to  track  stu¬ 
dent  progress,  to  determine  initiative,  and  to  answer 
questions.  By  virtue  of  being  embedded  in  AutoTutor, 
the  utterance  classifier  presented  here  has  an  unusual  set 
of  constraints,  both  practical  and  theoretical.  On  the 
practical  side,  AutoTutor  is  a  web-based  application  that 
performs  in  real  time;  thus  utterance  classification  must 
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also  proceed  in  real  time.  For  that  reason,  the  classifier 
uses  a  minimum  of  resources,  including  part  of  speech 
tagging  (Brill  1995;  Sekine  and  Grishman  1995)  and 
cascaded  finite  state  transducers  defining  the  categories. 
Theoretically  speaking,  AutoTutor  must  also  recognize 
questions  in  a  meaningful  way  to  both  question  answer¬ 
ing  and  tutoring.  The  question  taxonomy  utilized,  that 
of  Graesser  et  al  (1992),  is  an  extension  of  Lehnert's 
(1978)  taxonomy  for  question  answering  and  has  been 
applied  to  human  tutoring  (Graesser  et  al.  1992; 
Graesser  and  Person  1994). 

This  paper  outlines  the  utterance  classifier  and  quan¬ 
tifies  its  performance.  In  particular.  Section  2  presents 
AutoTutor.  Section  3  presents  the  utterance  taxonomy. 
Section  4  describes  the  classifier  algorithm.  Section  5 
delineates  the  training  process  and  results.  Section  6 
presents  evaluation  of  the  classifier  on  real  AutoTutor 
sessions.  Section  7  concludes  the  paper. 

2  AutoTutor 

AutoTutor  is  an  ITS  applicable  to  any  content  domain. 
Two  distinct  domain  applications  of  AutoTutor  are 
available  on  the  Internet,  for  computer  literacy  and  con¬ 
ceptual  physics.  The  computer  literacy  AutoTutor, 
which  has  now  been  used  in  experimental  evaluations 
by  over  200  students,  tutors  students  on  core  computer 
literacy  topics  covered  in  an  introductory  course,  such 
as  operating  systems,  the  Internet,  and  hardware.  The 
topics  covered  by  the  physics  AutoTutor  are  grounded 
in  basic  Newtonian  mechanics  and  are  of  a  similar  in¬ 
troductory  nature.  It  has  been  well  documented  that 
AutoTutor  promotes  learning  gains  in  both  versions 
(Person  et  al.  2001). 

AutoTutor  simulates  the  dialog  patterns  and  peda¬ 
gogical  strategies  of  human  tutors  in  a  conversational 
interface  that  supports  mixed-initiative  dialog.  AutoTu¬ 
tor’ s  architecture  is  comprised  of  seven  highly  modular 
components:  (1)  an  animated  agent,  (2)  a  curriculum 
script,  (3)  a  speech  act  classifier,  (4)  latent  semantic 
analysis  (LSA),  (5)  a  dialog  move  generator,  (6)  a  Dia¬ 
log  Advancer  Network,  and  (7)  a  question-answering 
tool  (Graesser  et  al.  1998;  Graesser  et  al.  2001; 
Graesser  et  al.  2001;  Person  et  al.  2000;  Person  et  al. 
2001;  Wiemer-Hastings  et  al.  1998). 

A  tutoring  session  begins  with  a  brief  introduction 
from  AutoTutor’s  three-dimensional  animated  agent. 
AutoTutor  then  asks  the  student  a  question  from  one  of 
topics  in  the  curriculum  script.  The  curriculum  script 
contains  lesson-specific  tutor-initiated  dialog,  including 
important  concepts,  questions,  cases,  and  problems 
(Graesser  and  Person  1994;  Graesser  et  al.  1995; 
McArthur  et  al.  1990;  Putnam  1987).  The  student  sub¬ 
mits  a  response  to  the  question  by  typing  and  pressing 
the  “Submit”  button.  The  student’s  contribution  is  then 
segmented,  parsed  (Sekine  and  Grishman  1995)  and 


sent  through  a  rule-based  utterance  classifier.  The  clas¬ 
sification  process  makes  use  of  only  the  contribution 
text  and  part-of-speech  tag  provided  by  the  parser. 

Mixed-initiative  dialog  starts  with  utterance  classifi¬ 
cation  and  ends  with  dialog  move  generation,  which  can 
include  question  answering,  repeating  the  question  for 
the  student,  or  just  encouraging  the  student.  Concur¬ 
rently,  the  LSA  module  evaluates  the  quality  of  the  stu¬ 
dent  contributions,  and  in  the  tutor-initiative  mode,  the 
dialog  move  generator  selects  one  or  a  combination  of 
specific  dialog  moves  that  is  both  conversationally  and 
pedagogically  appropriate  (Person  et  al  2000;  Person  et 
al.  2001).  The  Dialog  Advancer  Network  (DAN)  is  the 
intermediary  of  dialog  move  generation  in  all  instances, 
using  information  from  the  speech  act  classifier  and 
LSA  to  select  the  next  dialog  move  type  and  appropriate 
discourse  markers.  The  dialog  move  generator  selects 
the  actual  move.  There  are  twelve  types  of  dialog 
move:  Pump,  Hint,  Splice,  Prompt,  Prompt  Response, 
Elaboration,  Summary,  and  five  forms  of  immediate 
short-feedback  (Graesser  and  Person  1994;  Graesser  et 
al.  1995;  Person  and  Graesser  1999). 

3  An  utterance  taxonomy 

The  framework  for  utterance  classification  in  Table  1  is 
familiar  to  taxonomies  in  the  cognitive  sciences 
(Graesser  et  al.  1992;  Graesser  and  Person  1994).  The 
most  notable  system  within  this  framework  is  QUALM 
(Lehnert  1978),  which  utilizes  twelve  of  the  question 
categories.  The  taxonomy  can  be  divided  into  3  distinct 
groups,  questions,  frozen  expressions,  and  contribu¬ 
tions.  Each  of  these  will  be  discussed  in  turn. 

The  conceptual  basis  of  the  question  categories 
arises  from  the  observation  that  the  same  question  may 
be  asked  in  different  ways,  e.g.  "What  happened?"  and 
"How  did  this  happen?"  Correspondingly,  a  single  lexi¬ 
cal  stem  for  a  question,  like  "What"  can  be  polysemous, 
e.g.  both  in  a  definition  category,  "What  is  the  definition 
of  gravity?"  and  metacommunicative,  "What  did  you 
say?"  Eurthermore,  implicit  questions  can  arise  in  tutor¬ 
ing  via  directives  and  some  assertions,  e.g.  "Tell  me 
about  gravity"  and  "I  don't  know  what  gravity  is."  In 
AutoTutor  these  information  seeking  utterances  are 
classified  to  one  of  the  16  question  categories. 

The  emphases  on  queried  concepts  rather  than  ortho¬ 
graphic  forms  make  the  categories  listed  in  Table  1  bear 
a  strong  resemblance  to  speech  acts.  Indeed,  Graesser 
et  al.  (1992)  propose  that  the  categories  be  distinguished 
in  precisely  the  same  way  as  speech  acts,  using  seman¬ 
tic,  conceptual,  and  pragmatic  criteria  as  opposed  to 
syntactic  and  lexical  criteria.  Speech  acts  presumably 
transcend  these  surface  criteria:  it  is  not  what  is  being 
said  as  what  is  done  by  the  saying  (Austin,  1962;  Searle, 
1975). 


Category 


Example 


Questions 
Verification 
Disjunctive 
Concept  Completion 
Feature  Specification 
Quantification 
Definition 
Example 
Comparison 
Interpretation 
Causal  Antecedent 
Causal  Consequence 
Goal  Orientation 
Instrumental/Procedural 
Enablement 
Expectational 
Judgmental 

Frozen  Expressions 
Metacognitive  I  don't  understand. 

Metacommunicative  Could  you  repeat  that? 

Contribution  The  pumpkin  will  land  in  the  runner's  hands 

Table  1.  AutoTutor’s  utterance  taxonomy. 

The  close  relation  to  speech  acts  underscores  what  a  dialogue  between  tutor  and  student,  often  calling  for  a 

difficult  task  classifying  conceptual  questions  can  be.  repetition  of  the  tutor's  last  utterance.  Two  key  points 

Jurafsky  and  Martin  (2000)  describe  the  problem  of  are  worth  noting:  frozen  expressions  have  a  much 

interpreting  speech  acts  using  pragmatic  and  semantic  smaller  variability  than  questions  or  contributions,  and 

inference  as  Al-complete,  i.e.  impossible  without  creat-  frozen  expressions  may  be  followed  by  some  content, 

ing  a  full  artificial  intelligence.  The  alternative  ex-  making  them  more  properly  treated  as  questions.  Eor 

plored  in  this  paper  is  cue  or  surface-based  example,  "I  don't  understand"  is  frozen,  but  "I  don't  un¬ 
classification,  using  no  context.  derstand  gravity"  is  a  more  appropriately  a  question. 

It  is  particularly  pertinent  to  the  present  discussion  Contributions  in  the  taxonomy  can  be  viewed  as 
that  the  sixteen  qualitative  categories  are  employed  in  a  anything  that  is  not  frozen  or  a  question;  in  fact,  that  is 

quantitative  classification  process.  That  is  to  say  that  essentially  how  the  classifier  works.  Contributions  in 

for  the  present  purposes  of  classification,  a  question  AutoTutor,  either  as  responses  to  questions  or  un- 

must  belong  to  one  and  only  one  category.  On  the  one  prompted,  are  tracked  to  evaluate  student  performance 

hand  this  idealization  is  necessary  to  obtain  easily  ana-  via  ESA,  forming  the  basis  for  feedback, 

lyzed  performance  data  and  to  create  a  well-balanced 

training  corpus.  On  the  other  hand,  it  is  not  entirely  4  Classifier  Algorithm 
accurate  because  some  questions  may  be  assigned  to 

multiple  categories,  suggesting  a  polythetic  coding  The  present  approach  ignores  the  semantic  and  prag- 

scheme  (Graesser  et  al.  1992).  Inter-rater  reliability  is  matic  context  of  the  questions,  and  utilizes  surface  fea- 

used  in  the  current  study  as  a  benchmark  to  gauge  this  tures  to  classify  questions.  This  shallow  approach 

potential  effect.  parallels  work  in  question  answering  (Srihari  and  Li 

Erozen  expressions  consist  of  metacognitive  and  2000;  Soubbotin  and  Soubbotin  2002;  Moldovan  et  al 

metacommunicative  utterances.  Metacognitive  utter-  1999).  Specifically,  the  classifier  uses  tagging  provided 

ances  describe  the  cognitive  state  of  the  student,  and  by  ApplePie  (Sekine  and  Grishman  1995)  followed  by 

they  therefore  require  a  different  response  than  ques-  cascaded  finite  state  transducers  defining  the  categories, 

tions  or  assertions.  AutoTutor  responds  to  metacogni-  The  finite  state  transducers  are  roughly  described  in 

tive  utterances  with  canned  expressions  such  as,  "Why  Table  2.  Every  transducer  is  given  a  chance  to  match, 

don't  you  give  me  what  you  know,  and  we'll  take  it  from  and  a  disambiguation  routine  is  applied  at  the  end  to 

there."  Metacommunicative  acts  likewise  refer  to  the  select  a  single  category. 


Does  the  pumpkin  land  in  his  hands? 

Is  the  pumpkin  accelerating  or  decelerating? 

Where  will  the  pumpkin  land? 

What  are  the  components  of  the  forces  acting  on  the  pumpkin? 

How  far  will  the  pumpkin  travel? 

What  is  acceleration? 

What  is  an  example  of  Newton's  Third  Law? 

What  is  the  difference  between  speed  and  velocity? 

What  is  happening  in  this  situation  with  the  runner  and  pumpkin? 

What  caused  the  pumpkin  to  fall? 

What  happens  when  the  runner  speeds  up? 

Why  did  you  ignore  air  resistance? 

How  do  you  calculate  force? 

What  principle  allows  you  to  ignore  the  vertical  component  of  the  force? 
Why  doesn't  the  pumpkin  land  behind  the  runner? 

What  do  you  think  of  my  explanation? 


Immediately  after  tagging,  transducers  are  applied  to 
check  for  frozen  expressions.  A  frozen  expression  must 
match,  and  the  utterance  must  be  free  of  any  nouns,  i.e. 
not  frozen+content,  for  the  utterance  to  be  classified  as 
frozen.  Next  the  utterance  is  checked  for  question 
stems,  e.g.  WHAT,  HOW,  WHY,  etc.  and  question 
mark  punctuation.  If  question  stems  are  buried  in  the 
utterance,  e.g.  "I  don't  know  what  gravity  is",  a  move¬ 
ment  rule  transforms  the  utterance,  placing  the  stem  at 
the  beginning.  Likewise  if  a  question  ends  with  a  ques¬ 
tion  mark  but  has  no  stem,  an  AUX  stem  is  placed  at  the 
beginning  of  the  utterance.  In  this  way  the  same  trans¬ 
ducers  can  be  applied  to  both  direct  and  indirect  ques¬ 
tions.  At  this  stage,  if  the  utterance  does  not  possess  a 
question  stem  and  is  not  followed  by  a  question  mark, 
the  utterance  is  classified  as  a  contribution. 

Two  sets  of  finite  state  transducers  are  applied  to  po¬ 


tential  questions,  keyword  transducers  and  syntactic 
pattern  transducers.  Keyword  transducers  replace  a  set 
of  keywords  specific  to  a  category  with  a  symbol  for 
that  category.  This  extra  step  simplifies  the  syntactic 
pattern  transducers  that  look  for  the  category  symbol  in 
their  pattern.  The  definition  keyword  transducer,  for 
example,  replaces  "definition",  "define",  "meaning", 
"means",  and  "understanding"  with  "KEYDEF".  Eor 
most  categories,  the  keyword  list  is  quite  extensive  and 
exceeds  the  space  limitations  of  Table  2.  Keyword 
transducers  also  add  the  category  symbol  to  a  list  when 
they  match;  this  list  is  used  for  disambiguation.  Syntac¬ 
tic  pattern  transducers  likewise  match,  putting  a  cate¬ 
gory  symbol  on  a  separate  disambiguation  list. 

In  the  disambiguation  routine,  both  lists  are  con¬ 
sulted,  and  the  first  category  symbol  found  on  both  lists 
determines  the  classification  of  the  utterance.  Clearly 


Utterance  Category 

Finite  state  transducer  pattern 

Verification 

^AUX 

Disjunctive 

^AUX  ...  or 

Concept  Completion 

^(WholWhatlWhenlWhere) 

Feature  Specification 

'^What ...  keyword 
keyword 

Quantification 

'^What  AUX  ...  keyword 
^How  (ADJIADV) 

''MODAL  you  ...  keyword 

Definition 

''What  AUX  ...  (keywordia?  (ADJIADV)*  N 
''MODAL  you  ...  keyword 
what  a?  (ADJIADV)*  N  BE 

Example 

''AUX  ...  keyword 
''What  AUX  ...  keyword 

Comparison 

''What  AUX  ...  keyword 
''How  ...  keyword 
''MODAL  you  ...  keyword 

Interpretation 

keyword 

Causal  Antecedent 

Causal  Consequence 

''(WhylHow)  AUX  ...  (VBpastIkeyword) 

''(WHIHow) ...  keyword 

Goal  Orientation 

''(WhatlWhy)  AUX  ART?  (NPISUBJPROIkeyword) 

''What ...  keyword 

Instrumental/Procedural 

''(WHIHow)  AUX  ART?  (NIPRO) 

''(WHIHow) ...  keyword 
''MODAL  you  ...  keyword 

Enablement 

''(WHIHow) ...  keyword 

Expectational 

''Why  AUX  ...  NEG 

Judgmental 
(youlyour)  ...  keyword 

(shouldikeyword)  (NIPRO) 

Frozen  (no  nouns) 

''SUBJPRO  ...  keyword 
''VB  ...  keyword  ...  OBJPRO 
''AUX  ...  SUBJPRO  ...  keyword 

Contribution 

Everything  else 

Table  2.  Finite  state  transducer  patterns 


ordering  of  transducers  affects  which  symbols  are  clos¬ 
est  to  the  beginning  of  the  list.  Ordering  is  particularly 
relevant  when  considering  categories  like  concept  com¬ 
pletion,  which  match  more  freely  than  other  categories. 
Ordering  gives  rarer  and  stricter  categories  a  chance  to 
match  first;  this  strategy  is  common  in  stemming  (Paice 
1990). 

5  Training 

The  classifier  was  built  by  hand  in  a  cyclical  process  of 
inspecting  questions,  inducing  rules,  and  testing  the 
results.  The  training  data  was  derived  from  brainstorm¬ 
ing  sessions  whose  goal  was  to  generate  questions  as 
lexically  and  syntactically  distinct  as  possible.  Of  the 
brainstormed  questions,  only  when  all  five  raters  agreed 
on  the  category  was  a  question  used  for  training;  this 
approach  filtered  out  polythetic  questions  and  left  only 
archetypes. 

Intuitive  analysis  suggested  that  the  majority  of 
questions  have  at  most  a  two-part  pattern  consisting  of  a 
syntactic  template  and/or  a  keyword  identifiable  for  that 
category.  A  trivial  example  is  disjunction,  whose  syn¬ 
tactic  template  is  auxiliary-initial  and  corresponding 
keyword  is  “or”.  Other  categories  were  similarly  de¬ 
fined  either  by  one  or  more  patterns  of  initial  constitu¬ 
ents,  or  a  keyword,  or  both.  To  promote 
generalizability,  extra  care  was  given  not  to  overfit  the 
training  data.  Specifically,  keywords  or  syntactic  pat¬ 
terns  were  only  used  to  define  categories  when  they 
occurred  more  than  once  or  were  judged  highly  diagnos¬ 
tic. 


Classifier 

present 

Expert 

—ipresent 

present 

tp 

fp 

— ipresent 

fn 

tn 

Table  3.  Contingency  Table. 


The  results  of  the  training  process  are  shown  in  Ta¬ 
ble  4.  Results  from  each  category  were  compiled  in  2  x 
2  contingency  tables  like  Table  3,  where  tp  stands  for 
"true  positive"  and/n  for  "false  negative". 

Recall,  fallout,  precision,  and  f-measure  were  calcu¬ 
lated  in  the  following  way  for  each  category; 

Recall  =  tp  /  ( tp  H-  fn  ) 

Fallout  =  fp  /  (  fp  H-  tn  ) 

Precision  =  tp  /  ( tp  H-  fp  ) 

F-measure  =  2  *  Recall  *  Precision 

Recall  +  Precision 

Recall  and  fallout  are  often  used  in  signal  detection 
analysis  to  calculate  a  measure  called  d’  (Green  and 


Swets  1966).  Under  this  analysis,  the  performance  of 
the  classifier  is  significantly  more  favorable  than  under 
the  F-measure,  principally  because  the  fallout,  or  false 
alarm  rate,  is  so  low.  Both  in  training  and  evaluation, 
however,  the  data  violate  assumptions  of  normality  that 
d’  requires. 

As  explained  in  Section  3,  a  contribution  classifica¬ 
tion  is  the  default  when  no  other  classification  can  be 
given.  As  such,  no  training  data  was  created  for  contri¬ 
butions.  Likewise  frozen  expressions  were  judged  to  be 
essentially  a  closed  class  of  phrases  and  do  not  require 
training.  Absence  of  training  results  for  these  categories 
is  represented  by  double  stars  in  Table  4. 

During  the  training  process,  the  classifier  was  never 
tested  on  unseen  data.  A  number  of  factors  it  difficult  to 
obtain  questions  suitable  for  testing  purposes.  Brain¬ 
stormed  questions  are  an  unreliable  source  of  testing 
data  because  they  are  not  randomly  sampled.  In  gen¬ 
eral,  corpora  proved  to  be  an  unsatisfactory  source  of 
questions  due  to  low  inter-rater  reliability  and  skewed 
distribution  of  categories. 

Low  inter-rater  reliability  often  could  be  traced  to 
anaphora  and  pragmatic  context.  For  example,  the 
question  "Do  you  know  what  the  concept  of  group  cell 
is?"  might  license  a  definition  or  verification,  depending 
on  the  common  ground.  "Do  you  know  what  it  is?" 
could  equally  license  a  number  of  categories,  depending 
on  the  referent  of  "it".  Such  questions  are  clearly  be¬ 
yond  the  scope  of  a  classifier  that  does  not  use  context. 

The  skewed  distribution  of  the  question  categories 
and  their  infrequency  necessitates  use  of  an  extraction 
algorithm  to  locate  them.  Simply  looking  for  question 
marks  is  not  enough:  our  estimates  predict  that  raters 
would  need  to  classify  more  than  5,000  questions  ex¬ 
tracted  from  the  Wall  Street  Journal  this  way  to  get  a 
mere  20  instances  of  the  rarest  types.  A  bootstrapping 
approach  using  machine  learning  is  a  possible  alterna¬ 
tive  that  will  be  explored  in  the  future  (Abney  2002). 

Regardless  of  these  difficulties,  the  strongest  evalua¬ 
tion  results  from  using  the  classifier  in  a  real  world  task, 
with  real  world  data. 

6  Evaluation 

The  classifier  was  used  in  AutoTutor  sessions  through¬ 
out  the  year  of  2002.  The  log  files  from  these  sessions 
contained  9094  student  utterances,  each  of  which  was 
classified  by  an  expert.  The  expert  ratings  were  com¬ 
pared  to  the  classifier's  ratings,  forming  a  2  x  2  contin¬ 
gency  table  for  each  category  as  in  Table  4. 

To  expedite  ratings,  utterances  extracted  from  the 
log  files  were  split  into  two  groups,  contributions  and 
non-contributions,  according  to  their  logged  classifica¬ 
tion.  Expert  judges  were  assigned  to  a  group  and  in¬ 
structed  to  classify  a  set  of  utterances  to  one  of  the  18 
categories.  Though  inter-rater  reliability  using  the 


Training  Data 

AutoTutor  Performance 

CATEGORY 

Recall 

Fallout 

Precision 

F-measure 

Recall 

Fallout 

Precision 

F-measure  Fikelihood  Ratio 

Contribution 

** 

** 

** 

** 

0.983 

0.054 

0.999 

0.991 

1508.260 

Erozen 

** 

** 

** 

** 

0.899 

0.002 

0.849 

0.873 

978.810 

Concept 

Completion 

0.844 

0.035 

0.761 

0.800 

0.857 

0.003 

0.444 

0.585 

235.800 

Interpretation 

0.545 

0.009 

0.545 

0.545 

0.550 

0.000 

0.917 

0.688 

135.360 

Definition 

0.667 

0.002 

0.941 

0.780 

0.424 

0.001 

0.583 

0.491 

131.770 

Verification 

0.969 

0.004 

0.969 

0.969 

0.520 

0.004 

0.255 

0.342 

103.880 

Comparison 

0.955 

0.011 

0.778 

0.857 

1.000 

0.004 

0.132 

0.233 

55.460 

Quantification 

0.949 

0.002 

0.982 

0.966 

0.556 

0.003 

0.139 

0.222 

43.710 

Expecational 

0.833 

0.010 

0.833 

0.833 

1.000 

0.000 

0.667 

0.800 

33.870 

Procedural 

0.545 

0.009 

0.545 

0.545 

1.000 

0.000 

1.000 

1.000 

20.230 

Goal 

Orientation 

0.926 

0.006 

0.893 

0.909 

1.000 

0.001 

0.143 

0.250 

14.490 

Judgmental 

0.842 

0.010 

0.865 

0.853 

0.500 

0.001 

0.167 

0.250 

12.050 

Disjunction 

0.926 

0.000 

1.000 

0.962 

0.333 

0.000 

0.250 

0.286 

11.910 

Causal 

Antecedent 

0.667 

0.017 

0.667 

0.667 

0.200 

0.001 

0.083 

0.118 

8.350* 

Eeature 

Specification 

0.824 

0.006 

0.824 

0.824 

0.000 

0.000 

0.000 

0.000 

0.000* 

Enablement 

0.875 

0.006 

0.903 

0.889 

0.000 

0.000 

0.000 

0.000 

0.000* 

Causal 

Consequent 

0.811 

0.008 

0.882 

0.845 

0.000 

0.000 

0.000 

0.000 

0.000* 

Example 

0.950 

0.008 

0.826 

0.884 

** 

** 

JICM! 

** 

** 

Table  4.  Training  data  and  AutoTutor  results. 


kappa  statistic  (Carletta  1996)  may  be  calculated  for 
each  group,  the  distribution  of  categories  in  the  contri¬ 
bution  group  was  highly  skewed  and  warrants  further 
discussion. 

Skewed  categories  bias  the  kappa  statistic  to  low 
values  even  when  the  proportion  of  rater  agreement  is 
very  high  (Feinstein  and  Cicchetti  1990a;  Feinstein  and 
Cicchetti  1990b).  In  the  contribution  group,  judges  can 
expect  to  see  mostly  one  category,  contribution, 
whereas  judges  in  the  non-contribution  group  can  ex¬ 
pect  to  see  the  other  17  categories.  Expected  agreement 
by  chance  for  the  contribution  group  was  98%.  Corre¬ 
spondingly,  inter-rater  reliability  using  the  kappa  statis¬ 
tic  was  low  for  the  contribution  group,  .5  despite  99% 
proportion  agreement,  and  high  for  non-contribution 
group,  .93. 

However,  the  .93  inter-rater  agreement  can  be  ex¬ 
tended  to  all  of  the  utterance  categories.  Due  to  classi¬ 
fier  error,  the  non-contribution  group  consisted  of  38% 
contributions.  Thus  the  .93  agreement  applies  to  contri¬ 
butions  in  this  group.  Equal  proportion  of  agreement 
for  contribution  classifications  in  both  groups,  99%, 
suggests  that  the  differences  in  kappa  solely  reflect  dif¬ 
ferences  in  category  skew  across  groups.  Under  this 
analysis,  dividing  the  utterances  into  two  groups  im¬ 
proved  the  distribution  of  categories  for  the  calculation 
of  kappa  (Eeinstein  and  Cicchetti  1990b). 


Expert  judges  classified  questions  with  a  .93  kappa, 
which  supports  a  monothetic  classification  scheme  for 
this  application.  In  Section  3  the  possibility  was  raised 
of  a  polythetic  scheme  for  question  classification,  i.e. 
one  in  which  two  categories  could  be  assigned  to  a 
given  question.  If  a  polythetic  scheme  were  truly  neces¬ 
sary,  one  would  expect  inter -rater  reliability  to  suffer  in 
a  monothetic  classification  task.  High  inter-rater  reli¬ 
ability  on  the  monothetic  classification  task  renders 
polythetic  schemes  superfluous  for  this  application. 

The  recall  column  for  evaluation  in  Table  4  is  gener¬ 
ally  much  higher  than  corresponding  cells  in  the  preci¬ 
sion  column.  The  disparity  implies  a  high  rate  of  false 
positives  for  each  of  the  categories.  One  possible  ex¬ 
planation  is  the  reconstruction  algorithm  applied  during 
classification.  It  was  observed  that,  particularly  in  the 
language  of  physics,  student  used  question  stems  in  ut¬ 
terances  that  were  not  questions,  e.g.  “The  ball  will  land 
when  ...”  Such  falsely  reconstructed  questions  account 
for  40%  of  the  questions  detected  by  the  classifier. 
Whether  modifying  the  reconstruction  algorithm  would 
improve  E-measure,  i.e.  improve  precision  without  sac¬ 
rificing  recall,  is  a  question  for  future  research. 

The  distribution  of  categories  is  highly  skewed:  97% 
of  the  utterances  were  contributions,  and  example  ques¬ 
tions  never  occurred  at  all.  In  addition  to  recall,  fallout, 
precision,  and  E-measure,  significance  tests  were  calcu- 


lated  for  each  category's  contingency  table  to  insure  that 
the  cells  were  statistically  significant.  Since  most  of  the 
categories  had  at  least  one  cell  with  an  expected  value 
of  less  than  1,  Fisher's  exact  test  is  more  appropriate  for 
significance  testing  than  likelihood  ratios  or  chi-square 
(Pedersen  1996).  Those  categories  that  are  not  signifi¬ 
cant  are  starred;  all  other  categories  are  significant,  p  < 
.001. 

Though  not  appropriate  for  hypothesis  testing  in  this 
instance,  likelihood  ratios  provide  a  comparison  of  clas¬ 
sifier  performance  across  categories.  Likelihood  ratios 
are  particularly  useful  when  comparing  common  and 
rare  events  (Dunning  1993;  Flaunt  and  Norgard  1998), 
making  them  natural  here  given  the  rareness  of  most 
question  categories  and  the  frequency  of  contributions. 
The  likelihood  ratios  in  the  rightmost  column  of  Table  4 
are  on  a  natural  logarithmic  scale,  -21nA.,  so  procedural 
at  c  ^  X  20  23  _  247  j  2  jg  more  likely  than  goal  orientation, 
at  g .  5  X  14.49  _  2401,  with  respect  to  the  base  rate,  or  null 
hypothesis. 

To  judge  overall  performance  on  the  AutoTutor  ses¬ 
sions,  an  average  weighted  F-measure  may  be  calcu¬ 
lated  by  summing  the  products  of  all  category  F- 
measures  with  their  frequencies: 

V-'  tp  +  fn 

=  /  ^  -  measure  X - 

The  average  weighted  F-measure  reflects  real  world 
performance  since  accuracy  on  frequently  occurring 
classes  is  weighted  more.  The  average  weighted  F- 
measure  for  the  evaluation  data  is  .98,  mostly  due  to  the 
great  frequency  of  contributions  (.97  of  all  utterances) 
and  the  high  associated  F-measure.  Without  weighting, 
the  average  F-measure  for  the  significant  cells  is  .54. 

With  respect  to  the  three  applications  mentioned,  i) 
tracking  student  understanding,  ii)  mixed-initiative  dia¬ 
logue,  and  iii)  questions  answering,  the  classifier  is  do¬ 
ing  extremely  well  on  the  first  two  and  adequately  on 
the  last.  The  first  two  applications  for  the  most  part 
require  distinguishing  questions  from  contributions, 
which  the  classifier  does  extremely  well,  F-measure  = 
.99.  Question  answering,  on  the  other  hand,  can  benefit 
from  more  precise  identification  of  the  question  type, 
and  the  average  unweighted  F-measure  for  the  signifi¬ 
cant  questions  is  .48. 

7  Conclusion 

One  of  the  objectives  of  this  work  was  to  see  how  well  a 
classifier  could  perform  with  a  minimum  of  resources. 
Using  no  context  and  only  surface  features,  the  classi¬ 
fier  performed  with  an  average  weighted  F-measure  of 
.98  on  real  world  data. 


However,  the  question  remains  how  performance 
will  fare  as  rare  questions  become  more  frequent.  Scaf¬ 
folding  student  questions  has  become  a  hot  topic  re¬ 
cently  (Graesser  et  al.  2003).  In  a  system  that  greatly 
promotes  question-asking,  the  weighted  average  of  .97 
will  tend  to  drift  closer  to  the  unweighted  average  of 
.54.  Thus  there  is  clearly  more  work  to  be  done. 

Future  directions  include  using  bootstrapping  meth¬ 
ods  and  statistical  techniques  on  tutoring  corpora  and 
using  context  to  disambiguate  question  classification. 
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